37 research outputs found
An Information Theoretic approach to Post Randomization Methods under Differential Privacy
Post Randomization Methods (PRAM) are among the most popular disclosure limitation techniques for both categorical and continuous data. In the categorical case, given a stochastic matrix M and a specified variable, an individual belonging to category i is changed to category j with probability Mi,j . Every approach to choose the randomization matrix M has to balance between two desiderata: 1) preserving as much statistical information from the raw data as possible; 2) guaranteeing the privacy of individuals in the dataset. This trade-off has generally been shown to be very challenging to solve. In this work, we use recent tools from the computer science literature and propose to choose M as the solution of a constrained maximization problems. Specifically, M is chosen as the solution of a constrained maximization problem, where we maximize the Mutual Information between raw and transformed data, given the constraint that the transformation satisfies the notion of Differential Privacy. For the general Categorical model, it is shown how this maximization problem reduces to a convex linear programming and can be therefore solved with known optimization algorithms
Survival analysis via hierarchically dependent mixture hazards
Hierarchical nonparametric processes are popular tools for defining priors on collections of probability distributions, which induce dependence across multiple samples. In survival analysis problems, one is typically interested in modeling the hazard rates, rather than the probability distributions themselves, and the currently available methodologies are not applicable. Here, we fill this gap by introducing a novel, and analytically tractable, class of multivariate mixtures whose distribution acts as a prior for the vector of sample-specific baseline hazard rates. The dependence is induced through a hierarchical specification of the mixing random measures that ultimately corresponds to a composition of random discrete combinatorial structures. Our theoretical results allow to develop a full Bayesian analysis for this class of models, which can also account for right-censored survival data and covariates, and we also show posterior consistency. In particular, we emphasize that the posterior characterization we achieve is the key for devising both marginal and conditional algorithms for evaluating Bayesian inferences of interest. The effectiveness of our proposal is illustrated through some synthetic and real data examples
More for less: Predicting and maximizing genetic variant discovery via Bayesian nonparametrics
While the cost of sequencing genomes has decreased dramatically in recent
years, this expense often remains non-trivial. Under a fixed budget, then,
scientists face a natural trade-off between quantity and quality; they can
spend resources to sequence a greater number of genomes (quantity) or spend
resources to sequence genomes with increased accuracy (quality). Our goal is to
find the optimal allocation of resources between quantity and quality.
Optimizing resource allocation promises to reveal as many new variations in the
genome as possible, and thus as many new scientific insights as possible. In
this paper, we consider the common setting where scientists have already
conducted a pilot study to reveal variants in a genome and are contemplating a
follow-up study. We introduce a Bayesian nonparametric methodology to predict
the number of new variants in the follow-up study based on the pilot study.
When experimental conditions are kept constant between the pilot and follow-up,
we demonstrate on real data from the gnomAD project that our prediction is more
accurate than three recent proposals, and competitive with a more classic
proposal. Unlike existing methods, though, our method allows practitioners to
change experimental conditions between the pilot and the follow-up. We
demonstrate how this distinction allows our method to be used for (i) more
realistic predictions and (ii) optimal allocation of a fixed budget between
quality and quantity
Nonparametric Bayesian multi-armed bandits for single cell experiment design
The problem of maximizing cell type discovery under budget constraints is a
fundamental challenge for the collection and analysis of single-cell
RNA-sequencing (scRNA-seq) data. In this paper, we introduce a simple,
computationally efficient, and scalable Bayesian nonparametric sequential
approach to optimize the budget allocation when designing a large scale
experiment for the collection of scRNA-seq data for the purpose of, but not
limited to, creating cell atlases. Our approach relies on the following tools:
i) a hierarchical Pitman-Yor prior that recapitulates biological assumptions
regarding cellular differentiation, and ii) a Thompson sampling multi-armed
bandit strategy that balances exploitation and exploration to prioritize
experiments across a sequence of trials. Posterior inference is performed by
using a sequential Monte Carlo approach, which allows us to fully exploit the
sequential nature of our species sampling problem. We empirically show that our
approach outperforms state-of-the-art methods and achieves near-Oracle
performance on simulated and scRNA-seq data alike. HPY-TS code is available at
https://github.com/fedfer/HPYsinglecell
Mixture modeling via vectors of normalized independent finite point processes
Statistical modeling in presence of hierarchical data is a crucial task in
Bayesian statistics. The Hierarchical Dirichlet Process (HDP) represents the
utmost tool to handle data organized in groups through mixture modeling.
Although the HDP is mathematically tractable, its computational cost is
typically demanding, and its analytical complexity represents a barrier for
practitioners. The present paper conceives a mixture model based on a novel
family of Bayesian priors designed for multilevel data and obtained by
normalizing a finite point process. A full distribution theory for this new
family and the induced clustering is developed, including tractable expressions
for marginal, posterior and predictive distributions. Efficient marginal and
conditional Gibbs samplers are designed for providing posterior inference. The
proposed mixture model overcomes the HDP in terms of analytical feasibility,
clustering discovery, and computational time. The motivating application comes
from the analysis of shot put data, which contains performance measurements of
athletes across different seasons. In this setting, the proposed model is
exploited to induce clustering of the observations across seasons and athletes.
By linking clusters across seasons, similarities and differences in athlete's
performances are identified